162
13
Useful Algorithms
The radius rr must fall between the minimum and maximum distances between
the objects. The larger its value, the fewer will be the number of clusters. Possibly,
other criteria are needed to select the most appropriate value (e.g., from some prior
estimation of the likely number of clusters). The method of dynamic kernels is
analogous to hypersphere clustering.
The upper KK-means method. This method originated from the so-called iterative self-
organizing data analysis (ISODATA) technique. The centres ofupper KK clusters are chosen
simultaneously. Denoting the centre of the kkth cluster by upper Z Subscript kZk, k equals ModifyingAbove 1 comma upper K With quotation dashk = 1, K, then for the
process of cluster formation, in particular for the incorporation of any object upper XX into
cluster upper C Subscript kCk, we have
StartLayout 1st Row 1st Column r Subscript n minus 2 2nd Column equals 3rd Column r Subscript n minus 1 Baseline q Subscript n minus 1 Baseline plus r Subscript n Baseline comma 2nd Row 1st Column Blank 3rd Row 1st Column r Subscript n minus 1 2nd Column equals 3rd Column r Subscript n Baseline q Subscript n Baseline comma EndLayoutX ∈Ck
if ρ(X; Zk) ≤ρ(X; Zi) ,
(13.5)
wherek equals ModifyingAbove 1 comma upper K With quotation dashk = 1, K,i not equals ki /= k. In the next step, new centres of gravity for theupper KK subclusters
are computed. In the stepll, for each new dividingupper D Subscript lDl the functionalupper F left parenthesis upper D Subscript l Baseline right parenthesisF(Dl)is computed
by the expression
StartLayout 1st Row 1st Column r Subscript n minus 2 2nd Column equals 3rd Column r Subscript n minus 1 Baseline q Subscript n minus 1 Baseline plus r Subscript n Baseline comma 2nd Row 1st Column Blank 3rd Row 1st Column r Subscript n minus 1 2nd Column equals 3rd Column r Subscript n Baseline q Subscript n Baseline comma EndLayoutF(Dl) =
Σ
X∈Ckl
(X −Zkl)2 .
(13.6)
The optimal division is that for which the function upper FF takes its minimal value. The
process of dividing goes on until for the centres of the next two steps the condition
StartLayout 1st Row 1st Column r Subscript n minus 2 2nd Column equals 3rd Column r Subscript n minus 1 Baseline q Subscript n minus 1 Baseline plus r Subscript n Baseline comma 2nd Row 1st Column Blank 3rd Row 1st Column r Subscript n minus 1 2nd Column equals 3rd Column r Subscript n Baseline q Subscript n Baseline comma EndLayoutZk,l+1 = Zkl
(13.7)
is satisfied. The effectiveness of this algorithm depends on the chosen value of upper KK,
the selection of the initial clustering centres, and the actual location of the points in
feature space corresponding to the objects, which together constitute a significant
weakness of this method.
Distance metrics. The calculation of a distance between any two objects is funda-
mental to clustering. In Euclidean space, the operation is intuitively straightforward,
especially when the positions of each object in space are represented using Cartesian
coordinates. Thus, in one dimension, the distance between two objects at positions
x 1x1 andx 2x2 is simply their difference,StartAbsoluteValue x 1 minus x 2 EndAbsoluteValue|x1 −x2|. The procedure is generalized to higher
dimensions using familiar knowledge of coördinate geometry (Pythagoras’ theorem);
thus, for two orthogonal axesxx andyy, the distance isStartRoot left parenthesis x 1 minus x 2 right parenthesis squared plus left parenthesis y 1 minus y 2 right parenthesis squared EndRoot
/
(x1 −x2)2 + (y1 −y2)2. The
space must be chosen according to relevance. Thus, a collection of trees might be
characterized by height and the mean rate of photosynthesis per unit area of leaf.
Each member of the collection (set) would correspond to a point in this space. An
explicit procedure must be provided for assigning numerical values to these two
parameters. Ex hypothesi, they are considered to be independent; hence, the axes are
orthogonal. Especially when the number of dimensions of the chosen space is high,
it is convenient to reduce it to two, because of the inescapable convenience of repre-
senting data as a picture in two dimensions. For this purpose, principal component
analysis, described in the next Sect. 13.2.2, is a useful method.